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Abstract. It is now well known that decentralised optimisation can be 
formulated as a potential game, and game-theoretical learning algorithms 
can be used to find an optimum. One of the most common learning tech- 
niques in game theory is fictitious play. However fictitious play is founded 
on an implicit assumption that opponents' strategies are stationary. We 
present a novel variation of fictitious play that allows the use of a more 
realistic model of opponent strategy. It uses a heuristic approach, from 
the online streaming data literature, to adaptively update the weights 
assigned to recently observed actions. We compare the results of the pro- 
posed algorithm with those of stochastic and geometric fictitious play in 
a simple strategic form game, a vehicle target assignment game and a dis- 
aster management problem. In all the tests the rate of convergence of the 
proposed algorithm was similar or better than the variations of fictitious 
play we compared it with. The new algorithm therefore improves the 
performance of game-theoretical learning in decentralised optimisation. 



1 Introduction 

Decentralised optimisation is a crucial component of sensor networks [1,2], dis- 
aster management [3], traffic control [4] and scheduling [5]. In each of these 
domains a combination of computational and communication complexity render 
centralised optimisation approaches intractable. It is now well known that many 
decentralised optimisation problems can be formulated as a potential game [6-8] . 
Hence the optimisation problem can be recast in terms of finding a Nash equilib- 
rium of a potential game. An iterative decentralised optimisation algorithm can 
therefore be considered a type of learning in games algorithm, and vice versa. 

Fictitious play is the canonical example of learning in games [9] . Under fic- 
titious play each player maintains some beliefs about his opponents' strategies, 
and based on these beliefs he chooses the action that maximises his expected 
reward. The players then update their beliefs about opponents' strategies after 
observing their actions. Fictitious play converges to Nash equilibrium for certain 
kinds of games [9, 10] but in practice this convergence can be very slow. This is 
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because it implicitly assumes that the other players use a fixed strategy in the 
whole game by giving the same weight to every observed action. 

In [11] this problem was addressed by using particle filters to predict oppo- 
nents' strategies. The drawback of this approach is the computational cost of 
the particle filters that render difficult the application of this method in real life 
applications. 

In this paper we propose an alternative method which uses a heuristic rule 
to adapt the weights of opponents' strategies by taking into account their re- 
cent actions. We observe empirically that this approach reduces the number of 
steps that fictitious play needs to converge to a solution, and hence the commu- 
nications overhead between the distributed optimisers that is required to find 
a solution to the distributed optimisation problem. In addition the computa- 
tional demand of the proposed algorithm is similar to the classic fictitious play 
algorithm. 

The remainder of this paper is organised as follows. We start with a brief 
description of game theory, fictitious play and stochastic fictitious play. Section 
3 introduces adaptive forgetting factor fictitious play (AFFFP). The impact of 
the algorithm's parameters on its performance is studied in Section 4. Section 
5 presents the results of AFFFP for a climbing hill game, a vehicle target as- 
signment game and a disaster management simulation scenario. We finish with 
a conclusion. 

2 Background 

In this section we introduce the relationship between potential games and decen- 
tralised optimisation, as well as the classical fictitious play learning algorithm. 

2.1 Potential games and decentralised optimisation 

A class of games which maps naturally to the decentralised optimisation frame- 
work is strategic form games. The elements of a strategic form game are [12] 

- a set of players 1,2,...,/, 

- a set of actions s l e S l for each player i e /, 

- a set of joint actions, s = (s 1 , s 2 , . . . ,s')eS 1 xS 2 x...x5 , ' = S, 

- a payoff function u l : S — > R for each player i, where u l (s) is the utility that 
player i will gain after a specific joint action s has been played. 

We will often write s = (s l , s~ l ), where s l is the action of Player i and s~ l is the 
joint action of Player i's opponents. 

The rules that the players use to select the action that they will play in a game 
are called strategies. A player i chooses his actions according to a pure strategy 
when he selects his actions by using a deterministic rule. In the cases that he 
chooses an action based on a probability distribution then he acts according to a 
mixed strategy. If we denote the set of all the probability distributions over the 
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action space S l as A 1 , then a mixed strategy of player i is an element a % e A 1 . 
We define A as the set product of all A 1 , A = A 1 x . . . x A 1 . Then the joint mixed 
strategy a = (a 1 , . . . , a 1 ), is defined as an element of A and we will often write 
a = {a l 1 <7~' 1 ) analogously to s — (s l ,s~ l ). We will denote the expected utility 
a player i will gain if he chooses a strategy o % (resp. s J ), when his opponents 
choose the joint strategy o~ % as v! (a' 1 ', <7~ l ) (resp. u l (s l , <J~ 1 ))- 

Many decision rules can be used by the players to choose their actions in 
a game. One of them is to choose their actions from a set of mixed strategies 
that maximises their expected utility given their beliefs about their opponents' 
strategies. When Player i's opponents' strategies are a~ l then the best response 
of player i is defined as: 

BR 1 {a- 1 ) = argmax u\a\(j- 1 ). (1) 

Nash [13], based on Kakutani's fixed point theorem, showed that every game 
has at least one equilibrium. This equilibrium is a strategy a that is a fixed point 
of the best response correspondence, a 1 G BR l {<7~ % )ii. Thus when a joint mixed 
strategy a is a Nash equilibrium then 

u\a\ &-*) > uHs 1 , &-*) for all i, for all s l G S\ (2) 

Equation (2) implies that if a strategy a is a Nash equilibrium then it is not 
possible for a player to increase his utility by unilaterally changing his strategy. 
When all the players in a game select equilibrium actions using pure strategies 
then the equilibrium is referred as pure strategy Nash equilibrium. 

A particularly useful category of games for multi-agent decision problems is 
the class of potential games [10, 8, 7]. The utility function of an exact potential 
game satisfies the following property: 

u l (s\ «-') - u l (s\ s- 1 ) = cj)(s\ s- 1 ) - 4>Cs\ s- 1 ) (3) 

where <j) is a potential function and the above equality stands for every player i, 
for every action s~ l e S~ l , and for every pair of actions s l , s l e 5". The poten- 
tial function depicts the changes in the players' payoffs when they unilaterally 
change their actions. Every potential game has at least one pure strategy Nash 
equilibrium [10] . There may be more than one, but at any equilibrium no player 
can increase their reward, therefore the potential function, through a unilateral 
deviation. 

Wonderful life utility [6, 7] is a method to design the individual utility func- 
tions of a potential game such that the global utility function of a decentralised 
optimisation problem acts as the potential function. Player i's utility when a 
joint action s = (s l , s~ l ) is performed, is the difference in global utility obtained 
by the player selecting action s l in comparison with the global utility that would 
have been obtained if i had selected an (arbitrarily chosen) reference action s l a : 



u^s^s i ) = u g (s i ,s l )-u 3 (4,s l ) 



(4) 
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where u g is the global utility function. Hence the decentralised optimisation 
problem can be cast as a potential game, and any algorithm that is proved to 
converge to Nash equilibria will converge to a joint action from which no player 
can increase the global reward through unilateral deviation. 



2.2 Fictitious Play 

Fictitious play is a widely used learning technique in game theory. In fictitious 
play each player chooses his action according to the best response to his beliefs 
about opponent's strategy. 

Initially each player has some prior beliefs about the strategies that his oppo- 
nents use to choose actions. The players, after each iteration, update their beliefs 
about their opponents' strategy and play again the best response according to 
their beliefs. More formally in the beginning of a game players maintain some 
arbitrary non-negative initial weight functions k j , j = 1, . . . ,1 that are updated 
using the formula: 

K t( sj ) = K t-l( sj ) + I s i t =si ( 5 ) 

for each j, where 7 , = j J ^ = £ . 

The mixed strategy of opponent j is estimated from the following formula: 

Equations (5) and (6) are equivalent to: 

^ j )=(l-^)<4-i(s j ) + ^- si (7) 

where P = t + J2 s iesj K o( s ' 7 )- Pl aver i chooses an action which maximises his 
expected payoffs given his beliefs about his opponents' strategies. 

The main purpose of a learning algorithm like fictitious play is to converge 
to a set of strategies that are a Nash equilibrium. For classic fictitious play (7) 
it has been proved [9] that if a is a strict Nash equilibrium and it is played 
at time t then it will be played for all further iterations of the game. Also any 
steady state of fictitious play is a Nash equilibrium. Furthermore, it has been 
proved that fictitious play converges for 2 x 2 games with generic payoffs [14], 
zero sum games [15], games that can be solved using iterative dominance [16] 
and potential games [10]. There are also games where fictitious play does not 
converge to a Nash equilibrium. Instead it can become trapped in a limit cycle 
whose period is increasing through time. An example of such a game is Shapley's 
game [17]. 

A player i that uses the classic fictitious play algorithm uses best responses to 
his beliefs to choose his actions, so he chooses his actions s l from his pure strategy 
space S\ Randomisation is allowed only in the case that players arc indifferent 
between his available actions, but it is very rare in generic payoff strategic form 
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games for a player to be indifferent between the available actions [18]. Stochastic 
fictitious play is a variation of fictitious play where players use mixed strategies 
in order to choose actions. This variation was originally introduced to allow 
convergence of players' strategies to a mixed strategy Nash equilibrium [9] but 
has the additional advantage of introducing exploration into the process. 

The most common form of smooth best response of Player i, BFC{a~ l ), is 
the following [9]: 

npf -iu i) _ exp( M '(s%q-')/g) 

where £ is the randomisation parameter. When the value of £ is close to zero, a 
BR is close to BR and players exploit their action space, whereas large values 
of £ result in complete randomisation [9] . 

Stochastic fictitious play is a modification of fictitious play under which 
Player i uses BR l (a i T l ) to randomly select an action instead of selecting a best 
response action BR t (a^ 1 ). We reinforce the fact that the difference between clas- 
sic fictitious play and stochastic fictitious play is in the decision rule the players 
use to choose the actions. The updating rule (7) that is used to update the beliefs 
of the opponents' strategies are the same in both algorithms. 

When Player i uses equation (7) to update the beliefs about opponents' 
strategies he treats the environment of the game as stationary and implicitly 
assumes that the actions of the players are sampled from a fixed probability 
distribution [9]. Therefore recent observations have the same weight as initial 
ones. This approach leads to poor adaptation when other players change their 
strategies. 

A variation of fictitious play that treats the opponents' strategies as dynamic 
and places greater weights on recent observations while we calculate each action's 
probability is geometric fictitious play, introduced in [9]. According to this vari- 
ation of fictitious play the estimation of each opponent's probability to play an 
action s 3 is evaluated using the formula: 

4{ S i) = {l-z)ol 1 {si)+zI sj _ si (9) 

where z e (0, I ) is a constant. 

In Section 3 we introduce a new variant of fictitious play in which the constant 
z is automatically adapted in response to the observations of opponent strategy. 



3 Adaptive forgetting factor fictitious play 

The objective of players when they maintain beliefs cr t _l is to estimate the mixed 
strategy of opponents. However consider streaming data where in each time step 
a new observation arrives and it belongs to one of J available classes [19]. When 
the objective is to estimate the probability of each class given the observed data, 
this objective can be expressed as the fitting of a multinomial distribution to the 
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observed data stream. If a fixed multinomial distribution over time is assumed, 
then its parameters can be estimated using the empirical frequencies of the 
previously observed data. This is exactly the strategy estimation described in 
Section 2.2. But in real life applications it is rare to observe a data stream from a 
constant distribution. Hence there is a need for the learning algorithm to adapt 
to the distribution that the data currently follow. This is similar to iterative 
games, where we expect that all players update their strategies simultaneously. 

An approach that is widely used in the streaming data literature to handle 
changes in data distributions is forgetting. This suggests that recent observations 
have greater impact on the estimation of the algorithm's parameters than the 
older ones. Two methods of forgetting are commonly used: window based meth- 
ods and resetting. Salgado et.al [20] showed that when abrupt changes (jumps) 
are detected then the optimal policy is to reset the parameters of the algorithm. 
In the case of smooth changes (drift) in the data stream's distribution, a solution 
is to use only a segment of the data (a window). The simplest form of window 
based methods uses a specific segment size constantly; there are also approaches 
that adaptively change the size of the window but they are more complicated. 
Some examples of algorithms that use window based methods are [21-23]. 

Another method is to introduce forgetting, which is also used in geometric 
fictitious play (9), to discount the old information by giving higher weights to 
the recent observations. When the discount parameter is fixed it is necessary to 
know a priori the distribution of data and the way that they evolve through time 
due to the fact that we must choose the forgetting factor in advance. In addition 
the performance of the approximation when there are changes that result from a 
jump or non-constant drift is poor for a fixed forgetting factor. For those reasons 
this methodology has serious limitations. 

A more sophisticated solution is the use of a forgetting factor that takes 
into account the recent data and the previously estimated parameters of the 
model and adapts to observed changes in the data distribution. Such a forgetting 
factor was proposed by Haykin [24] in the case of recursive least squares filtering 
problems. In [24] the objective was the minimisation of the mean square error of a 
cost function that depends on an exponential weighting factor A. This forgetting 
factor is then recursively updated using gradient descent of the forgetting factor, 
A, with respect to the residual errors of the algorithm. Anagnostopoulos [19] 
proposed a generalisation of this method in the context of online streaming data 
from a generalised linear model according to which the forgetting factors are 
adaptively changed by using gradient ascent of the log-likelihood of the new 
data point. 

In the streaming data context, after t time intervals we observe a sequence of 
data X\, . . . , x t and we fit a model f(0t\x\ :t ), where 9 t are the model's parameters 
at time t. Note that the parameters of the model, 9 t , depend on the observed data 
stream xi-.t and the forgetting factors At. Since the estimated model parameters 
depend on A t we will write 9 t {\t)- The log-likelihood of the data that will arrive 
at time t+1, Xt+i, given the parameters of the model at time t will be denoted as 
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C(xt+i; O t (\t))- Then the update of the forgetting factor A t +i can be expressed 
as: 

JC(x t+1 ;6 t (X t )) , , 

Xt+i = At + 7 — (1U) 

where 7 is the learning rate parameter of the gradient ascent algorithm. 

As in [19], we can apply the forgetting factor of equation (10) in the case 
of fitting a multinomial distribution to streaming data. This will result a new 
update rule that players can use, instead of the classic fictitious play update rule 
(7), to maintain beliefs about opponents' strategies. 

In classic fictitious play the weight function (5) places the same weight on 
every observed action. In particular K J t (s J ) denotes the number of times that 
player j has played the action s- 7 in the game. To introduce forgetting the impact 
of the previously observed actions in the weight function will be discounted by 
a factor Xt-i- Such a weight function can be written as: 

4{s j ) = K-A-i{s j ) + i s u=si (h) 

where I g j =sj is the same identity function as in (7). To normalise we set n t = 

X^sjgSj K t( s '')- From the definition of K 3 t (s^) we can use the following recursion 
to evaluate n\ 

n{ = XUnU + 1. (12) 

Then player i's beliefs about his opponent j's probability of playing action s J 
will be: 

oi^) = ^f-. (13) 

Similarly to the case of geometric fictitious play < At < 1. Moreover when 
the value of A t is close to zero this results in very fast adaptation and when 
A t = the players are myopic, and thus they respond to the last action of their 
opponents. On the other hand when X t = 1 this results in the classic fictitious 
play update rule. 

From this point on in we will only consider inference over a single opponent 
mixed strategy in fictitious play. In the case of multiple opponents separate 
estimates are formed identically and independently for each opponent. We will 
therefore drop all dependence on player i, and write s t , <J t and n t {s) for the 
opponent's action, strategy and weight function respectively. 

The value of A should be updated in order to have adaptive forgetting factors. 
Initially we have to evaluate the logdikelihood of the recently observed action s t 
given the beliefs of the opponents strategies. The logdikelihood is of the following 
form: 

£(s t ;a t -i) = lna t _i(s t ) (14) 

When we replace a t -i(st) with its equivalent from (13) the log-likelihood can 
be written as: 

£(s t ;a- t _i) = ln( Kt ~ 1<ySt h = ln«;t_i(s t ) - lnrit_i (15) 
n t -i 
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In order to estimate the update of A, equation (10), the evaluation of the 
log-likelihood's derivative with respect to A is required. The terms n t and n t 
both depend on A. Hence the derivative of (15) is: 

<9£(s t ;cr t _i) 1 d Id 

dX K t -i{s t )dX n t -idX 

Note that n t (s t ) = A t _i/c t _i(s t ) + I 3 i i=sJ , so 

^Kt(«)U=A t _i = «t-l(«) + A t -l^Kt-l(s)U=A t _ 1 (17) 

and similarly 

d d 

^n t \ x=Xt _ 1 = n t _i + X t _ 1 —n t _ 1 \ x=Xt _ 1 (18) 

We can use equations (17) and (18) to recursively estimate JrrKt-i(s) for each 
s and J^7ii_i and hence calculate J^C(st;<Jt-i)- Summarising we can evaluate 
the adaptive forgetting factor A t as follows: 

A t = A t _i+ 7 ( l — -J-K t _i(s) — ^rUt-l) (19) 

\K t -i(s) oX n t -i oX J 

To ensure that X t remains in (0, 1) we truncate it to this interval whenever 
it leaves. 

After updating their beliefs players can choose their actions by choosing 
cither a best response to their beliefs of their opponents strategies or a smooth 
best response. Table 1 summarises the algorithm of adaptive forgetting factor 
fictitious play. 



At time t, each player carries out the following 

1. Updates the weights K J t (s j ) — \ 3 t _ 1 K J t _ 1 (s j ) + I g j =sj 

2. Update Jx K t-i( s ) an ^ Jx n t-i usm S equations (17) and (18) 

3. Based on the weights of step 1 each player updates his beliefs about his opponents 

strategies using ui(s j ) = K * j - , where n{ = X J t _ 1 n 3 t _ 1 + 1. 

4. Choose an action based on the beliefs of step 3 according either to best response, 
BR, or to smooth best response BR 

5. Observe opponent's action s ] t 

6. Update the forgetting factor using: = X 3 t _ 1 + 7^ 3 1 - - Jx^t-i( s ) — 

n i-i ax / 



Table 1. Adaptive forgetting factor fictitious play algorithm 
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4 AFFFP parameters 

The adaptive rule that we choose to update the forgetting factor A is based on 
the gradient ascent algorithm. It is well known that different initial values of 
an algorithm's parameter and learning rates 7 can lead to poor results of the 
gradient ascent algorithm [25] . This is because very small values of 7 lead to poor 
exploration of the space and thus the gradient ascent algorithm can be trapped 
in an area where the solution is not optimal, whereas large values of 7 can result 
in big jumps that will lead the algorithm away from the area of the optimum 
solution. Thus we should evaluate the performance of adaptive forgetting factor 
fictitious play for different combinations of the step size parameter 7 and initial 
values A . 

We employed a toy example, where a single opponent chooses his actions us- 
ing a mixed strategy which has a sinusoidal form, a situation which corresponds 
to smooth changes in the data distribution of online streaming data. The oppo- 
nent uses a strategy of the following form over the t — 1,2,... 1000 iterations 

COD 27Tt I 1 

of the game: cr t (l) = 1 = 1 — 0t(2), where f3 — 1000. We repeated this 

example 100 times for each combination of 7 and Ao- Each time we measured 
the mean square error of the estimated strategy against the real one. The range 
of 7 and Ao was 10~ 6 < 7 < 10 _1 and 10 _1 < A < 1 respectively. 




(a) Contour plot of mean square er- (b) Contour plot of mean square 
ror when the range of 7 and A is error when the range of 7 and 
KT 6 < 7 < KT 1 and KT 1 < A < 1 A is 1(T 6 < 7 < 5 x 1CT 3 and 
respectively. 0.6 < A < 1 respectively 

Fig. 1. Contour plot of mean square error 

The average mean square error for all the combinations of 7 and Ao is depicted 
on Figure 1 . The mean square error is minimised in the dark area of the contour 
plot. In Figure 1(a) we observe that when 7 is less than 10 -3 and Ao is greater 
than 0.6 the mean square error is minimised. In Figure 1(b) we reduce the range 
of 7 and A to be 10~ 6 < 7 < 5 x 10~ 3 and 0.6 < A < 1, respectively. Values of 
Ao greater than 0.75 result in estimators with small mean square error, for certain 
values of 7, and as the value of Ao approaches 0.95 so it is minimised for a wider 
range of learning rates. We also observe that when Aq is greater than 0.98 and 
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as we approach 1 then the mean square error increases. This suggests that for 
values of Ao greater than 0.98 the value of A appoaches 1 very fast and thus the 
algorithm behaves like the classic fictitious play update rule. In contrast when 
A is less than 0.75 then we introduce big discounts to the previously observed 
actions from the beginning of the game and the players easily use strategies 
that are reactions to their opponent's randomisation. In addition, independently 
from the initial value of Ao, when the learning rate 7, is greater than 0.001 the 
algorithm results in poor estimations of the opponent's strategy. This is because 
for 7 greater than 0.001 the step that a player moves towards the maximum of 
the log-likelihood is very large and that results in values of A which are close 
either to zero or to one. So the player uses either the classic fictitious play update 
rule or responds to his opponent's last observed action. 

We further examine the relationship between the performance of adaptive 
forgetting factor fictitious play and the sequence of A's in the drift toy example 
with respect to the initial values of the parameters Ao and 7 . We use two in- 
stances of the drift toy example. In the first one we set a fixed value of parameter 
7 = 10 and examine the performance of our algorithm and the evolution of 
A during the game for different values of A = {0.55, 0.8, 0.9, 0.95, 0.99}. In the 
second one we fix Ao and examined the results of the algorithm for different val- 
ues of 7 = {10~ 6 , 10~ 5 , 5 • 10~ 4 , 10~ 4 , 10~ 3 }. Figures 2 and 3 depict the results 
of the case of fixed 7 and Ao, respectively. Each row of these figures consists 
of two plots for the same set of parameters, 7 and Ao- The left figure shows 
the evolution of A during the game and the right one depicts the pre-specified 
strategy of the opponent and its corresponding prediction. 

As we observe in Figure 2, when we set A = 0.55 the tracking of the op- 
ponent's strategy was affected by his randomisation. The value of A constantly 
decreases which results in giving higher weights to the recently observed actions 
even if they are a consequence of randomisation. When we increase the value 
of Ao to 0.8 the results are improving. When we increase Ao to 0.90 or 0.95 
the resulting sequence of A's does not affect the tracking of opponent's strategy. 
On the other hand when we increase the value of A to 0.99 the value of A is 
very close to 1 for many iterations which result in poor approximation for the 
same reasons that the classic fictitious play update rule fails to capture smooth 
changes in opponent's strategy. When Ao is decreased to 0.9 the approximation 
of opponent's strategy improves significantly. 

Figure 3 depicts the results when Ao = 0.95 for different values of parameter 
7. We observe that high values of 7 (7 = 10~ 3 , 5 • 10~ 4 ) result in big changes in 
the value of A and that affects the quality of the approximation. On the other 
hand when we use very small values of 7, 7 = 10 -5 , or 7 = 10~ 6 , it leads to 
very small deviations from Ao- The good approximation results of opponent's 
strategy that we observe for those two values of 7 are because of the initial value 
Ao- In this scenario if we fix the value of A = 0.95 during the whole game we will 
also have a good approximation. But in real life applications it is impossible to 
choose so efficiently the value of Aq. When 7 = 10~ 4 we observe changes in the 
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values of A, which are not so sudden as the ones for 7 = 10 3 or 7 = 5 ■ 10 
that lead to good approximation of opponent's strategy. 




(left) (right) 



Fig. 2. Evolution of A and tracking of crt(l) when the true strategies are mixed when 
7 is fixed. The pre-specified strategy of the opponent and its prediction are depicted 
as the red and blue line respectively. 



We also performed simulations for a case where jumps occur. In this ex- 
ample we used a game of 1000 iterations and two available actions, 1 and 2. 
The opponent played action 1 with probability of (1) = 1 during the first 250 
and the last 250 iterations of the game and for the remaining iterations of the 
game of (1) = 0. The probability of the second action can be calculated by using 
of (2) = 1 — of (1). The results for the case of fixed 7 and Ao are depicted in 
Figures 4 and 5, respectively. 

When abrupt changes occur the different values of 7 do not affect the per- 
formance of the algorithm as we observe in Figure 5. On the contrary the initial 
value of A affects the estimation of the jumps in the opponent's strategies. As 
we observe in Figure 4 the opponent's strategy approximation and the evolution 
of A arc similar when A is equal to 0.55, 0.8 and 0.9 respectively. In those three 
cases when a jump is observed, there is a drop in the value of A, then A slightly 
increases and finally it remains constant until the next jump occurs. 

The sequences of A and the approximation results of the last two cases, 
A = 0.95 and A = 0.99 are different from the 3 cases we described above. In 
both of them the opponent's strategy tracking is good at the first 250 iterations, 
but afterwards these two examples have the opposite behaviour. In the example 
where Ao = 0.95 the opponent's strategy is correctly approximated when cr t (l) = 
0. Because of the high weights of the previously observed actions, the likelihood 
needs a large number of iterations to become constant and thus A becomes equal 
to 1. Then the adaptive forgetting factor fictitious play process becomes identical 
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(left) (right) 



Fig. 3. Evolution of A and tracking of crt(l) when the true strategies are mixed when 
Ao is fixed. The pre-specified strategy of the opponent and its prediction are depicted 
as the red and blue line respectively 




(left) (right) 



Fig. 4. Evolution of A and tracking of crt(l) when the true strategies are pure when 7 
is fixed. The pre-specified strategy of the opponent and its prediction are depicted as 
the red and blue line respectively. 
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Fig. 5. Evolution of A and tracking of <r t (l) when the true strategies are pure when Ao 
is fixed. The pre-specified strategy of the opponent and its prediction are depicted as 
the red and blue line respectively. 



to classic fictitious play and fails to adapt to the second jump. When Ao is equal 
to 0.99 adaptive forgetting factor fictitious play fails to adapt the estimation 
of opponent's strategy to the first jump. But when the second jump occurs, 
the likelihood of action 1 is small since action 2 is played for 500 consecutive 
iterations, and a drop in the value of A is observed which resulted in adaptation 
to the change of opponent's strategy. 

By taking into account the above results we observe that 7 = 10 -4 and 
0.8 < Ao < 0.9 leads to useful approximations when we consider cases where 
both smooth and abrupt changes in opponent's strategy are possible to happen. 
In the remainder of the article we set 7 = 10~ 4 and Aq = 0.8. 



5 Results 

5.1 Climbing hill game 

We initially compared the performance of the proposed algorithm with the re- 
sults of geometric and classic stochastic fictitious play in a three player climbing 
hill game. This game which is depicted in Table 2, generalises the climbing hill 
game that was presented in [26] and exhibits a long best response path from the 
risk-dominant joint mixed action (D,D,U) to the Nash equilibrium. 

We present the results of 1000 replications of a learning episode of 1000 
iterations for each game. For each replication of 1000 iterations we computed 
the mean payoff. After the end of the 1000 replications the overall mean of the 
1000 payoff means was computed. 

The value of the learning parameter z of geometric fictitious play was set 
to 0.1. We selected this value of z on the pemise that the algorithm has the 
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U M D 


U M D 


U M D 


u 





-300 70 80 


100 -300 90 


M 


50 40 


-300 60 





D 


30 










U 


M 


D 



Table 2. Climbing hill game with three players. Player 1 selects rows, Player 2 selects 
columns, and Player 3 selects the matrix. The global reward depicted in the matrices, 
is recived by all players. The unique Nash equilibrium is in bold. 



best results in the tracking experiment with pre-specified opponents strategy 
that we used in Section 4 to select the parameters of AFFFP. Thus we will 
use this learning rate ia all the simulations that we present in the rest of this 
article. For all algorithms we used smooth best responses (8) with randomisation 
parameter £ in the smooth best response function equal to 1, allowing the same 
randomisation for all algorithms. 

Adaptive forgetting factor fictitious play performed better than both geo- 
metric and stochastic fictitious play. The overall mean global payoff was 95.26 
for AFFFP whereas the respective payoffs for geometric and stochastic fictitious 
play were 91.7 and 70.3. Stochastic fictitious play didn't converge to the Nash 
equilibrium after 1000 replications. Also when we are concerned about the speed 
of convergence the proposed variations of fictitious play outperform geometric 
fictitious play. This can be seen if we reduce the iterations of the game to 200. 
Then the overall mean payoffs of AFFFP is 90.12 when for geometric fictitious 
play it is 63.12. This is because adaptive forgetting factor fictitious play requires 
approximately 100 iterations to reach the Nash equilibrium, when geometric 
fictitious play needs at least 300. This difference is depicted in Figure 6. 




Fig. 6. Probability of playing the (U,U,D) equilibrium for one run of each of AFFFP 
(blue dot line), geometric fictitious play (black solid line) and stochastic fictitious play 
(green diamond line) for the three player climbing hill game. 
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5.2 Vehicle target assignment game 

We also compared the performance of the proposed variation of fictitious play 
against the results of geometric fictitious play in the vehicle target assignment 
game that is described in [7]. In this game agents should coordinate to achieve 
a common goal which is to maximise the total value of the targets that are 
destroyed. In particular in a specific area we place I vehicles and J targets. For 
each vehicle i, its available actions are simply the targets that are available to 
engage. Each vehicle can choose only one target to engage but a target can be 
engaged by many vehicles. The probability that player i has to destroy a target j 
is pij if it chooses to engage target j. We assume that the probability each agent 
has to destroy a target is independent of the actions of the other agents, and the 
target is destroyed if any one agent succesfully destroyes it, so the probability a 
target j is destroyed by the vehicles that engage it is 1 — Ili-s^j (1 — Pij)- Each 
of the targets has a different value Vj. The expected utility that is produced 
from the target j is the product of its value Vj and the probability it has to 
be destroyed by the vehicles that engage it. More formally we can express the 
utility that is produced from target j as: 

U j (s) = V j (l- [] (1-P«)) (20) 

i:s i =j 

The global utility is then the sum of the utilities of each target: 

u g (s) = J2Uj(s). (21) 

j 

Wondcrfull life utility was used to to evaluate each vehicle's payoff. Then the 
utility that a vehicle i receives after engange a target j, s l = j, is 

Ul ( S \ S -*) = U^^s- 1 ) - Ujisls-') = (22) 

where s was set to be the greedy action of player i: s l = argmax VjPij. 

j 

In our simulations we used thirty vehicles and thirty targets that were placed 
uniformly at random in a unit square. The probability of a vehicle i to destroy 
a target j is proportional to the inverse of its distance from this target 1/dij. 
The values of the targets are independently sampled from a uniform distribution 
with range in [0 100]. 

The vehicles had to "negotiate" with the other vehicles (players) for a fixed 
number of negotiation steps before they choose a target to engage. A negotiation 
step begins with each player choosing a target to engage and it ends by the 
agents exchanging this information with the others, and updating their beliefs 
about their opponents' strategies based on this information. The target that 
each vehicle will choose in the game will be his action at the final negotiation 
step. 

Figure 7 depicts the average results for 100 instances of the game for the 
two algorithms, AFFFP and geometric fictitious play. For each instance, both 
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algorithms run for 100 negotiation steps. To be able to average across the 100 
instances we normalise the scores of an instance by the highest observed score 
for that instance (since some instances will have greater available score than 
others). As in the strategic form game, we set the randomisation parameter £ in 
the smooth best response function equal to 1 for both algorithms. 

In Figure 7 we observe that AFFFP result in a better solution on average 
than geometric fictitious play. Furthermore geometric fictitious play needs more 
iterations to reach the area where its reward is maximised than AFFFP. 




Fig. 7. Utility of AFFFP (dotted line) and geometric fictitious play (dashed line) for 
the vehicle target assignment game. 



5.3 Disaster management scenario 

Finally we test our algorithm in a disaster management scenario as described in 
[27]. Consider the case where a natural disaster has happened (an earthquake 
for example) and because of this Nj simultaneous incidents occurred in different 
areas of a town. In each incident j, a different number of people N p (j) are injured. 
The town has a specific number of ambulances N am b available that are able to 
collect the injured people. An ambulance i can be at the area of incident j in time 
Tij and has capacity Cj. We will assume that the total capacity of the ambulances 
is larger than the number of injured people. Our aim is to allocate the ambulances 
to the incidents in such a way that the average time N x y^ t -. a < • Ty that the 
ambulances need to reach the incidents is minimised while all the people that 
are engaged in the incident will be saved. Then we can formulate this scenario 
as follows. Each of the N am t, players should choose one of the Ni incidents as 
actions. The utility to the system of an allocation is: 

Y,™ax(0,N p (j)- Yl c ( 23 ) 

j = l i:s i =j 



1 ' 

iy amb ..... 

3=1 i;s l = i 
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where i = 1, . . . ,N am b and s l is the action of player i. The emergency units 
have to "negotiate" with each other and choose the incident which they will be 
allocated to, using a variant of fictitious play. 

The first component of the utility function expresses the first aim, to allocate 
the ambulance to an incident as fast as possible. Thus the agents have to choose 
an incident with small Ty. The second objective, which is to save all the injured 
people that are engaged in an incident, is expressed as the second component 
of the utility function. It is a penalty factor that adds to the average time the 
number of the people that were not able to be saved. Like the vehicle target 
assignment game each player can choose only one incident to help, but in each 
incident more than one player can provide his help. 

We follow [27] and consider simulations with 3 and 5 incidents, and 10, 15 
and 20 available ambulances. We run 200 trials for each of the combinations of 
ambulances and incidents, for each algorithm. Since this scenario is NP-complete 
[27] our aim is not to find the optimal solution, but to reach a sufficient or a 
near optimal solution. Furthermore the algorithm we present here is "any-time" , 
since the system utility generally increases as the time goes on, and therefore 
interruption before termination results in good, if not optimal actions. In each 
of the 200 trials the time that an ambulance needs to reach an incident is 
a random number uniformly distributed between zero and one. The capacity 
of each ambulance is an integer uniformly distributed between one and four. 
Finally the total number of injured people that are involved in each incident 
is a uniformly distributed integer between an d jfc, where c t is the total 

capacity of the emergency units c t — 2^ i=1 Cj. 

In each trial we allow 200 negotiation steps. In this scenario because of the 
utility function structure, a big randomisation parameter £ in the smooth best 
response function can easily lead to unnecessary randomisation. For that reason 
we set £ to 0.01 for both algorithms which results in a decision rule which ap- 
proximates best response. The learning rate and the initial value of A in adaptive 
forgetting factor is set to 10~ 4 and 0.8 respectively. 

We use the same performance measures as [27] to test the performance of our 
algorithms. We compared the solution of our algorithm against the centralised 
solution which can be obtained using binary integer programming. In particular 
we compared the solution of our algorithm against the one we obtain by using 
Matlab's bintprog algorithm, which uses a branch and bound algorithm that is 
based on linear programming relaxation [28-30]. To compare the result of these 
two algorithms we use the ratio y^, where ff p is the utility that the agents 
could gain if they used the variations of fictitious play we propose and f opt is the 
utility that the agents should gain if they were using the solution of bintprog. 
Thus values of the ratio smaller than one mean that the proposed variations of 
fictitious play perform better than bintprog, and values of the ratio larger than 
one mean that the proposed variations of fictitious play perform worst than 
bintprog. Furthermore we measured the percentage of the instances in which all 
the casualties are rescued, and the overall percentage of people that are rescued. 
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%complctc 


% saved 


ffp/ fopt 




10 ambulances 


82.0 


92.84 


1.2702 


3 incidents 


15 ambulances 


77.5 


89.88 


1.2624 




20 ambulances 


81.0 


90.59 


1.2058 




10 ambulances 


91.0 


98.54 


1.6631 


5 incidents 


15 ambulances 


90.5 


98.28 


1.5088 




20 ambulances 


83.0 


93.6 


1.4251 



Table 3. Results ol adaptive forgetting factor fictitious play after 200 negotiation steps 
for the three performance measures. 







%complctc 


% saved 


ffp/ fopt 




10 ambulances 


95,5 


99,74 


1.2970 


3 incidents 


15 ambulances 


74,5 


88.24 


1.2965 




20 ambulances 


60.55 


87.9 


1.2779 




10 ambulances 


94.5 


99.68 


1.8587 


5 incidents 


15 ambulances 


79.0 


88.28 


1.7443 




20 ambulances 


48.50 


86.54 


1.8545 



Table 4. Results of geometric fictitious play after 200 negotiation steps for the three 
performance measures. 



Tables 3 and 4 present the results we have obtained in the last step of ne- 
gotiations between the ambulances for the disaster management scenario when 
they use adaptive forgetting factor and geometric fictitious play respectively, to 
coordinate. 

The total percentage of the people that were saved and the ratio of ff p /f op t 
were similar within the groups of 3 and 5 incidents when the adaptive forgetting 
factor fictitious play algorithm were used. Regarding the percentage of the trials 
in which all people were saved, we can observe that as we increase the complexity 
of the scenario, hence the number of ambulances, the performance of adaptive 
forgetting factor fictitious play is decreasing. 

When we compare the results of the two algorithms we can observe that in 
both cases of 3 and 5 incidents respectively adaptive forgetting factor fictitious 
play perform better than geometric fictitious play when the scenarios included 
more ambulances, and therefore were more complicated. Especially in the case 
of the 20 ambulances the difference when we consider the number of the cases 
where all the casualties were collected from the incidents, was greater than 20%. 

The differences we can observe from bintprog's centralised solution, for both 
algorithms, can be explained from the structure of the utility function (23). The 
first component of the utility is a number between zero and one since it is the 
average of the times, TV,-, that the ambulances need to reach the incidents. On 
the other hand the penalty factor, even in the cases where only one person is not 
collected from the incidents, is greater than the first component of the utility. 
Thus a local search algorithm like the variations of fictitious play we propose 
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initially searches for an allocation that collects all the injured people, so the 
penalty component of the utility will be zero, and afterwards for the allocation 
that minimises also the average time that the ambulances needed to reach the 
incidents. It is therefore easy to become stuck in local optima. 

We have also examined how the results arc influenced by the number of 
iterations we use in each of the 200 trials of the game. For that reason we have 
compared the results that we would have obtained if in each instance of the 
simulations we had stopped the negotiations between the emergency units after 
50, 100, 150 and 200 iterations for both algorithms. Tables 5-7 and 8-10 depict 
the results for adaptive forgetting factor fictitious play and geometric fictitious 
play respectively. 







Iterations 






50 


100 


150 


200 




10 ambulances 


74.5 


81.0 


82.0 


82.0 


3 incidents 


15 ambulances 


73.0 


75.5 


77.5 


77.5 




20 ambulances 


73.5 


78.0 


80.0 


81.0 




10 ambulances 


80.0 


87.5 


90.0 


91.0 


5 incidents 


15 ambulances 


76.5 


84.5 


89.5 


90.5 




20 ambulances 


62.0 


77.0 


80.0 


83.0 



Table 5. Percentage of solutions in which the capacity of the ambulance in every 
incident was enough to cover all injured people for different stopping times of the 
negotiations, 50, 100, 150 and 200 iterations of the adaptive forgetting factor fictitious 
play algorithm. 







Iterations 


50 


100 


150 


200 


3 incidents 


10 ambulances 
15 ambulances 
20 ambulances 


91.44 
88.65 
89.09 


92.35 
90.28 
89.33 


92.95 
89.97 
90.44 


92.84 
89.88 
90.59 


5 incidents 


10 ambulances 
15 ambulances 
20 ambulances 


96.39 
94.84 
87.61 


97.80 
97.22 
91.90 


98.51 
98.03 
92.81 


98.54 
98.29 
93.61 



Table 6. Average percentage of injured people collected for different stopping times 
of the negotiations, 50, 100, 150 and 200 iterations of the adaptive forgetting factor 
fictitious play algorithm. 



We can see from tables 5-7 that the performance of adaptive forgetting factor 
fictitious play, in all the measures that we used, is similar for 100, 150 and 200 
negotiation steps. In particular when we consider the percentage of the instances 
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Iterations 


50 


100 


150 


200 


3 incidents 


10 ambulances 
15 ambulances 
20 ambulances 


1.2791 
1.2701 
1.2142 


1.2603 
1.2452 
1.1971 


1.2669 
1.2319 
1.1943 


1.2702 
1.2624 
1.2058 


5 incidents 


10 ambulances 
15 ambulances 
20 ambulances 


1.6989 
1.5569 
1.5298 


1.6827 
1.5352 
1.4413 


1.6772 
1.5304 
1.4306 


1.6631 
1.5088 
1.4251 



Table 7. Average percentage of the ratio ff p /f op t for different stopping times of the 
negotiations, 50, 100, 150 and 200 iterations of the adaptive forgetting factor fictitious 
play algorithm. 







Iterations 






50 


100 


150 


200 




10 ambulances 


94.0 


94.0 


94.7 


95.5 


3 incidents 


15 ambulances 


71.0 


73.0 


73.3 


74.5 




20 ambulances 


60.5 


61.3 


62.0 


63.0 




10 ambulances 


86.0 


92.0 


93.3 


94.5 


5 incidents 


15 ambulances 


79.0 


78.0 


79.0 


82.0 




20 ambulances 


42.0 


49.0 


47.3 


48.5 



Table 8. Percentage of solutions in which the capacity of the ambulance in every 
incident was enough to cover all injured people for different stopping times of the 
negotiations, 50, 100, 150 and 200 iterations of the geometric fictitious play algorithm. 







Iterations 


50 


100 


150 


200 


3 incidents 


10 ambulances 
15 ambulances 
20 ambulances 


99.62 
88.20 
86.65 


99.65 
88.21 
87.41 


99.69 
88.21 
88.23 


99.74 
88.24 
89.90 


5 incidents 


10 ambulances 
15 ambulances 
20 ambulances 


97.20 
94.84 
85.33 


99.54 
97.22 
86.38 


99.61 
98.03 
86.391 


99.68 
98.29 
86.54 



Table 9. Average percentage of injured people collected for different stopping times 
of the negotiations, 50, 100, 150 and 200 iterations of the geometric fictitious play 
algorithm. 
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Iterations 


50 


100 


150 


200 


3 incidents 


10 ambulances 
15 ambulances 
20 ambulances 


1.2957 
1.2965 
1.2779 


1.2928 
1.3039 
1.2744 


1.3039 
1.2801 
1.2723 


1.2970 
1.2740 
1.2722 


5 incidents 


10 ambulances 
15 ambulances 
20 ambulances 


1.8587 
1.7443 
1.8545 


1.8540 
1.7205 
1.7845 


1.8550 
1.7085 
1.7744 


1.8519 
1.7074 
1.7727 



Table 10. Average percentage of the ratio ff p /f op t for different stopping times of the 
negotiations, 50, 100, 150 and 200 iterations of the geometric fictitious play algorithm. 



that the ambulances collected all the injured people and the ratio ff p /f op t the 
difference in the results after 100 and 200 negotiation steps is between 1% and 
4.5%. The differences become even smaller for the percentage of the people that 
were saved which was less than 2%. 

Geometric fictitious play was trapped in an area of a local minimum after 
few iterations since the results are similar after 50, 100, 150 and 200 iterations. 
This is reflected in the results where geometric fictitious play performed worse 
than adaptive forgetting factor fictitious play especially in the complicated cases 
where the negotiations where between 20 ambulances. 

Adaptive forgetting factor fictitious play performed also better than the Ran- 
dom Neural Network (RNN) presented in [27] , when we consider the percentage 
of the cases in which all the injured people are collected and the overall percent- 
age of people that arc rescued. The percentage of instances where the proposed 
allocations by the RNN could collect all the casualties were from 25 to 69 per- 
cent. The corresponding results of adaptive forgetting factor fictitious play are 
from 77.5 to 94.5. The overall percentage of people that are rescued by the RNN 
algorithm are similar to the ones of adaptive forgetting factor fictitious play, 
between 85 and 98.5 percent. The ratio reported by [27] is better than that 
shown here. However in [27] only the examples in which all the casualties were 
collected were included to evaluate the ratio. Cases with high penalties, since the 
uncollected casualties introduce higher penalties than the inefficient allocation, 
were excluded from the ratio evaluation. Thus artificially improve their metric, 
especially when one considers that in many instances less than 40% of their 
solutions were included. 

6 Conclusions 

Fictitious play is a classic learning algorithm in games, but it is formed on an 
(incorrect) stationarity assumption. Therefore we have introduced a variation of 
fictitious play, adaptive forgetting factor fictitious play, which address this prob- 
lem by giving higher weights to the recently observed actions using a heuristic 
rule from the streaming data literature. 
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We examined the impact of adaptive forgetting factor fictitious play parame- 
ters A and 7 on the results of the algorithm. We showed that these two param- 
eter should be chosen carefully since there are combinations of A and 7 that 
induce very poor results. An example of such combination is when high values of 
the learning rate 7 are combined with low values of Ao- This is because values of 
A < 0.6 assign small weights to the previously observed actions and this results 
in volatile estimations that are influenced by opponents' randomisation. High 
values of the learning rate 7, mean that Ao is driven still lower, exacerbating the 
problem further. From the simulation results we have seen that a satisfactory 
combination of parameters Ao and 7 is 0.8 < Ao > 0.9 and 7 = 10~ 4 . 

Adaptive forgetting factor performed better than the competitor algorithms 
in the climbing hill game. Moreover it converged to the a better solution than 
geometric fictitious play in the vehicle target assignment game. In the disaster 
management scenario the performance of the proposed variation of fictitious play 
compared favorably with that of geometric fictitious play and a pre-planning 
algorithm that uses neural networks [27]. 

Our empirical observations indicate that adaptive forgetting factor fictitious 
play converges to a solution that is at least as good as that given by the com- 
petitor algorithms. Hence by slightly increasing the computational intensity of 
fictitious play less communication is required between agents to quickly coordi- 
nate on a desirable solution. 
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